Vision Transformers (ViTs) outperforms convolutional neural networks (CNNs) in several vision tasks with its global modeling capabilities. However, ViT lacks the inductive bias inherent to convolution making it require a large amount of data for training. This results in ViT not performing as well as CNNs on small datasets like medicine and science. We experimentally found that masked autoencoders (MAE) can make the transformer focus more on the image itself, thus alleviating the data-hungry issue of ViT to some extent. Yet the current MAE model is too complex resulting in over-fitting problems on small datasets. This leads to a gap between MAEs trained on small datasets and advanced CNNs models still. Therefore, we investigated how to reduce the decoder complexity in MAE and found a more suitable architectural configuration for it with small datasets. Besides, we additionally designed a location prediction task and a contrastive learning task to introduce localization and invariance characteristics for MAE. Our contrastive learning task not only enables the model to learn high-level visual information but also allows the training of MAE's class token. This is something that most MAE improvement efforts do not consider. Extensive experiments have shown that our method shows state-of-the-art performance on standard small datasets as well as medical datasets with few samples compared to the current popular masked image modeling (MIM) and vision transformers for small datasets.The code and models are available at https://github.com/Talented-Q/SDMAE.
translated by 谷歌翻译
Unsupervised foreground-background segmentation aims at extracting salient objects from cluttered backgrounds, where Generative Adversarial Network (GAN) approaches, especially layered GANs, show great promise. However, without human annotations, they are typically prone to produce foreground and background layers with non-negligible semantic and visual confusion, dubbed "information leakage", resulting in notable degeneration of the generated segmentation mask. To alleviate this issue, we propose a simple-yet-effective explicit layer independence modeling approach, termed Independent Layer Synthesis GAN (ILSGAN), pursuing independent foreground-background layer generation by encouraging their discrepancy. Specifically, it targets minimizing the mutual information between visible and invisible regions of the foreground and background to spur interlayer independence. Through in-depth theoretical and experimental analyses, we justify that explicit layer independence modeling is critical to suppressing information leakage and contributes to impressive segmentation performance gains. Also, our ILSGAN achieves strong state-of-the-art generation quality and segmentation performance on complex real-world data.
translated by 谷歌翻译
Facial expression recognition (FER) plays a significant role in the ubiquitous application of computer vision. We revisit this problem with a new perspective on whether it can acquire useful representations that improve FER performance in the image generation process, and propose a novel generative method based on the image inversion mechanism for the FER task, termed Inversion FER (IFER). Particularly, we devise a novel Adversarial Style Inversion Transformer (ASIT) towards IFER to comprehensively extract features of generated facial images. In addition, ASIT is equipped with an image inversion discriminator that measures the cosine similarity of semantic features between source and generated images, constrained by a distribution alignment loss. Finally, we introduce a feature modulation module to fuse the structural code and latent codes from ASIT for the subsequent FER work. We extensively evaluate ASIT on facial datasets such as FFHQ and CelebA-HQ, showing that our approach achieves state-of-the-art facial inversion performance. IFER also achieves competitive results in facial expression recognition datasets such as RAF-DB, SFEW and AffectNet. The code and models are available at https://github.com/Talented-Q/IFER-master.
translated by 谷歌翻译
Compared with the vanilla transformer, the window-based transformer offers a better trade-off between accuracy and efficiency. Although the window-based transformer has made great progress, its long-range modeling capabilities are limited due to the size of the local window and the window connection scheme. To address this problem, we propose a novel Token Transformer (TT). The core mechanism of TT is the addition of a Class (CLS) token for summarizing window information in each local window. We refer to this type of token interaction as CLS Attention. These CLS tokens will interact spatially with the tokens in each window to enable long-range modeling. In order to preserve the hierarchical design of the window-based transformer, we designed Feature Inheritance Module (FIM) in each phase of TT to deliver the local window information from the previous phase to the CLS token in the next phase. In addition, we have designed a Spatial-Channel Feedforward Network (SCFFN) in TT, which can mix CLS tokens and embedded tokens on the spatial domain and channel domain without additional parameters. Extensive experiments have shown that our TT achieves competitive results with low parameters in image classification and downstream tasks.
translated by 谷歌翻译
Recently, neural network based methods have shown their power in learning more expressive features on the task of knowledge graph embedding (KGE). However, the performance of deep methods often falls behind the shallow ones on simple graphs. One possible reason is that deep models are difficult to train, while shallow models might suffice for accurately representing the structure of the simple KGs. In this paper, we propose a neural network based model, named DeepE, to address the problem, which stacks multiple building blocks to predict the tail entity based on the head entity and the relation. Each building block is an addition of a linear and a non-linear function. The stacked building blocks are equivalent to a group of learning functions with different non-linear depth. Hence, DeepE allows deep functions to learn deep features, and shallow functions to learn shallow features. Through extensive experiments, we find DeepE outperforms other state-of-the-art baseline methods. A major advantage of DeepE is the robustness. DeepE achieves a Mean Rank (MR) score that is 6%, 30%, 65% lower than the best baseline methods on FB15k-237, WN18RR and YAGO3-10. Our design makes it possible to train much deeper networks on KGE, e.g. 40 layers on FB15k-237, and without scarifying precision on simple relations.
translated by 谷歌翻译
Multiview self-supervised representation learning roots in exploring semantic consistency across data of complex intra-class variation. Such variation is not directly accessible and therefore simulated by data augmentations. However, commonly adopted augmentations are handcrafted and limited to simple geometrical and color changes, which are unable to cover the abundant intra-class variation. In this paper, we propose to extract the underlying data variation from datasets and construct a novel augmentation operator, named local manifold augmentation (LMA). LMA is achieved by training an instance-conditioned generator to fit the distribution on the local manifold of data and sampling multiview data using it. LMA shows the ability to create an infinite number of data views, preserve semantics, and simulate complicated variations in object pose, viewpoint, lighting condition, background etc. Experiments show that with LMA integrated, self-supervised learning methods such as MoCov2 and SimSiam gain consistent improvement on prevalent benchmarks including CIFAR10, CIFAR100, STL10, ImageNet100, and ImageNet. Furthermore, LMA leads to representations that obtain more significant invariance to the viewpoint, object pose, and illumination changes and stronger robustness to various real distribution shifts reflected by ImageNet-V2, ImageNet-R, ImageNet Sketch etc.
translated by 谷歌翻译
对话策略模块是任务完成对话系统的重要组成部分。最近,越来越多的兴趣集中在加强学习(RL)的对话政策上。其有利的绩效和明智的行动决策取决于对动作值的准确估计。高估问题是RL的一个众所周知的问题,因为其对最大动作值的估计大于地面真理,这导致了不稳定的学习过程和次优政策。这个问题不利于基于RL的对话政策学习。为了减轻此问题,本文提出了一个动态的部分平均估计器(DPAV),对地面真相最大动作值。 DPAV计算预测的最大动作值和最小动作值之间的部分平均值,其中权重动态自适应和问题依赖性。我们将DPAV纳入了对话策略,并将DPAV纳入了对话策略,并表明我们的方法可以在不同域的三个对话数据集中获得更好或可比较的结果,并具有较低的计算负载。另外,与其他方法相比,理论上还证明了收敛性并得出偏置的上限和下限。
translated by 谷歌翻译
Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commonsense Multilingual Language Models Analysis (GeoMLAMA), for probing the diversity of the relational knowledge in multilingual PLMs. GeoMLAMA contains 3,125 prompts in English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts shared by people from American, Chinese, Indian, Iranian and Kenyan cultures. We benchmark 11 standard multilingual PLMs on GeoMLAMA. Interestingly, we find that 1) larger multilingual PLMs variants do not necessarily store geo-diverse concepts better than its smaller variant; 2) multilingual PLMs are not intrinsically biased towards knowledge from the Western countries (the United States); 3) the native language of a country may not be the best language to probe its knowledge and 4) a language may better probe knowledge about a non-native country than its native country. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA.
translated by 谷歌翻译
场景文本识别是一个流行的主题,在行业中广泛使用。尽管许多方法在封闭式文本识别挑战方面取得了令人满意的性能,但这些方法在开放式场景中丧失了可行性,在开放式场景中,收集数据或新颖性格的重新培训可能会产生高成本。例如,对外语的注释样本可能很昂贵,而每次从历史文档中发现新颖角色时,请重新训练该模型。在本文中,我们介绍并制定了一项新的开放式文本识别任务,该任务要求能够发现和识别新颖的角色而无需再培训。标签到原型学习框架也被提议作为建议任务的基准。具体而言,该框架引入了可推广的标签到原型映射功能,以构建可见和看不见类的原型(类中心)。然后使用开放式预测指标来识别或拒绝样品。在集合字符上的拒绝能力实现允许在传入数据流中自动发现未知字符。广泛的实验表明,我们的方法在各种零射击,封闭设置和开放式文本识别数据集上实现了有希望的性能
translated by 谷歌翻译
近年来,弱监督学习已成为一种流行的技术。在本文中,我们提出了一种新的医学图像分类算法,称为弱监督的生成对抗网络(Wsgan),其仅使用少量的真实图像而没有标签来生成假图像或掩模图像以放大训练的样本大小放。首先,我们将与MixMatch相结合以生成假图像和未标记图像的伪标签进行分类。其次,将对比学习和自我关注机制引入提出的问题,以提高分类准确性。第三,模式崩溃的问题通过循环一致性损失很好地解决。最后,我们设计全局和本地分类器,可以通过分类所需的关键信息来补充彼此。在四个医学图像数据集上的实验结果表明,Wsgan可以通过使用少数标记和未标记的数据来获得相对高的学习性能。例如,Wsgan的分类准确性高于具有100个标记的Mixmatch的11%,在10个标记的图像和OCT数据集上有1000个未标记的图像。此外,我们还开展了消融实验来验证我们算法的有效性。
translated by 谷歌翻译